Skip to content

fix(cpu-ops): lazy transpose for Q8_0 packed tensors#736

Merged
michalharakal merged 1 commit into
developfrom
fix/q8_0-lazy-transpose
Jun 15, 2026
Merged

fix(cpu-ops): lazy transpose for Q8_0 packed tensors#736
michalharakal merged 1 commit into
developfrom
fix/q8_0-lazy-transpose

Conversation

@michalharakal

Copy link
Copy Markdown
Contributor

Problem

DefaultCpuOps.transpose rewraps packed bytes with a flipped shape for the K-series (Q4_K/Q5_K/Q6_K) and Q5_0/Q5_1, but Q8_0 falls through to the generic FP32 DenseTensorDataFactory path, which casts the Byte-backed buffer to Float and throws:

ClassCastException: class java.lang.Byte cannot be cast to class java.lang.Float
  at DenseTensorDataFactory.init → DefaultCpuOpsBase.transpose → linearProject

This blocks keeping a Q8_0 matmul weight packed through linearProject (matmul(x, transpose(W))).

Fix

Add the analogous is Q8_0TensorData -> Q8_0BlockTensorData(Shape(cols, rows), d.packedData) case (one line + import). Bytes are layout-agnostic to the kernel's [out, in] block-major convention, so this is a metadata-only (lazy) transpose like the others.

Why it matters

Unblocks FunctionGemma's tied Q8_0 lm_head staying packed in the eager NATIVE_OPTIMIZED path instead of dequanting to FP32 (~0.67 GB), which OOMs the 1.9 GB Astra Machina SL2610.

Verification

SKaiNET-transformers GemmaQ5KPackedParityTest (composite -PuseLocalSkainet=true) now packs the lm_head as Q8_0 and decodes byte-identically to the FP32 baseline. See SKaiNET-transformers #178.

🤖 Generated with Claude Code

ops.transpose rewraps the packed bytes with a flipped shape for the K-series
(Q4_K/Q5_K/Q6_K) and Q5_0/Q5_1, but Q8_0 fell through to the generic FP32
DenseTensorDataFactory path, which casts the Byte-backed buffer to Float and
throws ClassCastException. Add the analogous Q8_0BlockTensorData case.

This unblocks keeping a Q8_0 matmul weight packed through linearProject
(matmul(x, transpose(W))) — notably FunctionGemma's tied Q8_0 lm_head, which
otherwise has to dequant to FP32 (~0.67 GB) and OOMs the 1.9 GB SL2610 board.

Verified: SKaiNET-transformers GemmaQ5KPackedParityTest (eager load(NATIVE_OPTIMIZED))
now packs the lm_head as Q8_0 and decodes byte-identically to the FP32 baseline.
See SKaiNET-transformers#178.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit cd2bfd2 into develop Jun 15, 2026
6 checks passed
@michalharakal michalharakal deleted the fix/q8_0-lazy-transpose branch June 15, 2026 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant